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Abstract 

We consider the problem of on-line prediction competitive with a 
benchmark class of continuous but highly irregular prediction rules. It is 
known that if the benchmark class is a reproducing kernel Hilbert space, 
there exists a prediction algorithm whose average loss over the first N ex¬ 
amples does not exceed the average loss of any prediction rule in the class 
plus a “regret term” of The elements of some natural bench¬ 

mark classes, however, are so irregular that these classes are not Hilbert 
spaces. In this paper we develop Banach-space methods to construct a 
prediction algorithm with a regret term of where p G [2, oo) 

and p — 2 reflects the degree to which the benchmark class fails to be a 
Hilbert space. 


1 Introduction 

For simplicity, in this introductory section we only discuss the problem of pre¬ 
dicting labels Un of objects Xn € [0,1] (this will remain our main example 
throughout the paper). In this paper we are mainly interested in extending 
the class of the prediction rules our algorithms are competitive with; in other 
respects, our assumptions are rather restrictive. For example, we always assume 
that the labels yn are bounded in absolute value by a known positive constant Y 
and only consider the problem of square-loss regression (some ideas for extension 
to a wider range of loss functions can be found in |dtip. 

Standard methods allow one to construct a “universally consistent” on-line 
prediction algorithm, i.e., an on-line prediction algorithm whose average loss 
over the first N examples does not exceed the average loss of any continuous 
prediction rule plus o(l). (Such methods were developed in, e.g., |H1, EOI, and, 
especially, 0, §3.2; for an explicit statement see m-) More specifically, for any 
reproducing kernel Hilbert space (RKHS) on [0,1] one can construct an on-line 
prediction algorithm whose average loss does not exceed that of any prediction 
rule in the RKHS plus choosing a universal RKHS (ISSI, Definition 

4) gives universal consistency. In this paper we are interested in extending the 
latter result, which is much more specific than the o(l) provided by universal 
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consistency, to wider benchmark classes of prediction rules. First we discuss 
limitations of RKHS as benchmark classes. 

The regularity of a prediction rule D can be measured by its “Holder expo¬ 
nent” h, which is informally defined by the condition that \D{x -I- dx) — D{x)\ 
scale as \dx\^ for small \dx\. The most regular continuous functions are those 
of classical analysis: say, piecewise differentiable with bounded derivatives. For 
such functions the Holder exponent is 1. Familiar examples are x sin a: and 
X I—^ |a: — 1/2|. Functions much less regular than those of classical analysis are 
ubiquitous in probability theory: for example, typical trajectories of the Brow¬ 
nian motion (more generally, of non-degenerate diffusion processes) have Holder 
exponent 1/2. Functions with other Holder exponents h G (0,1) can be obtained 
as typical trajectories of the fractional Brownian motion. Three examples with 
different values of h are shown in Figure ^ 

The intuition behind the informal notion of a function with Holder exponent 
h will be captured using function spaces known as Sobolev spaces. Roughly, 
the Sobolev spaces kF'*’P([0,1]) (defined formally in the next section), where 
p e (l,oo], s G (0,1), and s > 1/p, can be regarded as different ways of 
formalizing the notion of a function on [0,1] with Holder exponent h > s. 

The most familiar Sobolev spaces are the Holder spaces kF'*’°°([0,1]), con¬ 
sisting of the functions / satisfying \f{x) — f{y)\ = O (lx — j/|'’). The Holder 
spaces are nested, 1F'*’°°([0,1]) C IF'* ’°°([0,1]) when s' < s. (That all Holder 
spaces are very different can be seen from the fact that typical trajectories of 
the fractional Brownian motion defined in 13 are in 1F"’~([0,1]) for s < h 
and outside 1F'*’°°([0,1]) for s > h.) As we will see in a moment, the standard 
Hilbert-space methods only work for 1F'*’°°([0,1]) with s > 1/2 as benchmark 
classes; our goal is to develop methods that would work for smaller s as well. 

The spaces 1F®’°°([0,1]) are rather awkward analytically and even poorly 
reflect the intuitive notion of Holder exponent: they are defined in terms of 
sup,j, ,^|/(x) — /(y)|/|x — yl'*, and so /’s behavior in the neighborhood of a single 
point can disqualify it from being a member of 1F®’°°([0,1]). Replacing sup 
with the mean (in the sense of L^) w.r. to a natural “almost finite” measure 
gives the Sobolev spaces IF'*’P([0, 1]) for p < oo. Results for the case p < oo 
immediately carry over to p = oo since, as we will see in the next section, 
1F'*’°“([0,1]) C 1F'*'’^’([0,1]) whenever s' < s; s' can be arbitrarily close to s. 



Figure 1: Functions with Holder exponent h for three different values of h. 


2 


All Sobolev spaces (including the Holder spaces) are Banach spaces, but 
1]) are also Hilbert spaces and, for s > 1/2, even RKHS. Therefore, 
they are amenable to the standard methods (see the papers mentioned above; 
the exposition of m is especially close to that of this paper, although we wrote 

instead of in |T7)1. 

The condition s > 1/p appears indispensable in the development of the 
theory (cf. the reference to the Sobolev imbedding theorem in the next section). 
Since this paper concentrates on the irregular end of the Sobolev spectrum, 
s < 1/2, instead of the Hilbert spaces 1T®’^([0,1]) we now have to deal with the 
Banach spaces 1T®’P([0, 1]) with p G (2,oo), which are not Hilbert spaces. The 
necessary tools are developed in §SHS1 

The methods of m relied on the perfect shape of the unit ball in a Hilbert 
space. If p is not very far from 2, the unit ball in is not longer perfectly 

round but still convex enough to allow us to obtain similar results by similar 
methods. In principle, the condition s > 1/p is not longer an obstacle to coping 
with any s > 0: by taking a large enough p we can reach arbitrarily small 
s. However, the quality of prediction (at least as judged by our bound) will 
deteriorate: as we will see (Theorem^in the next section), the average loss of our 
prediction algorithm does not exceed that of any prediction rule in W^’P{[0, 1]) 
plus 0{N~^^P). (This gives a regret term of for the prediction rules 

in 1T®’°°([0,1]), where s < 1/2 and e > 0.) 

2 Main result 

We consider the following perfect-information prediction protocol: 

FOR n = 1,2,...: 

Reality announces a:„ G X. 

Predictor announces Pn, G M. 

Reality announces p„ G [—Y,Y]. 

END FOR. 

At the beginning of each round n Predictor is given an object Xn whose label is 
to be predicted. The set of a priori possible objects, the object space, is denoted 
X; we always assume X ^ 0. After Predictor announces his prediction /i„ for 
the object’s label he is shown the actual label 2 /„ G [—Y,Y]. We consider the 
problem of regression, p„ G M, assuming an upper bound Y on \yn\- The pairs 
{xn,yn) are called examples. 

Predictor’s loss on round n is measured by (p„ — Pn)^, and so his average 
loss after N rounds of the game is ^ J2n=i iVn ~ Mn)^- His goal is to have 



n—1 


n—1 


(^ meaning “is less than or approximately equal to”) for each prediction rule 
£> : X ^ R that is not “too wild”. 
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Main theorem 

Our main theorem will be fairly general and applicable to a wide range of Banach 
function spaces. Its implications for Sobolev spaces will be explained after its 
statement. 

Let ?7 be a Banach space and Su := {u G U \ ||u||y = 1} be the unit sphere 
in U. Our methods are applicable only to Banach spaces whose unit spheres do 
not have very flat areas; a convenient measure of rotundity of Sjj is Clarkson’s 
cni modulus of convexity 


(5(7(e) 


inf 

u,vGSu 

\W-Au=^ 



u + v 
2 



ee (0,2] 


( 1 ) 


(we will be mostly interested in the small values of e). 

Let us say that a Banach space T of real-valued functions / on X (with 
the standard pointwise operations of addition and scalar multiplication) is a 
proper Banach functional space (PBFS) on X if, for each x € X, the evaluation 
functional : f € IF fi^) is continuous. We will assume that 


cjr := sup ||ka;||^. < oo, (2) 

a:eX 

where tF* is the dual Banach space (see, e.g., |3J, Chapter 4). 

The following theorem will be proved in §2HS1 

Theorem 1 Let !F be a proper Banach functional space such that 


VeG (0,2] :5^(e)> (e/2)7p (3) 

for some p G [2, oo). There exists a prediction algorithm producing /r„ G [—T, Y] 
that are guaranteed to satisfy 

1 ^ 1 " I - 

- ^ (y„ - ^ E (y" - 1 

n—l n—1 

(4) 

for all N = 1,2,... and all D G T. 

Conditions © and © are satisfied for the Sobolev spaces IT®’^(X), which we 
will now define. 


Sobolev spaces 

Suppose X is an open or closed set in K"*. (The standard theory assumes that 
X is open, but the results we need easily extend to closed X.) We only define 
the Sobolev spaces IT®’P(X) for the cases s G (0,1) and p > m/s-, for a more 
general definition see, e.g., EZI (pp. 57, 61) or ^ (Theorem 7.48 and Remark 
7.49). 
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Let s e (0,1) and p > m/s. For a function / G LP(X.) define 


I s,p 


\fix)f da; 


X 


fix) - fiy) 


'-y\‘ 


da; dy 

Trri 

x-y\ 


i/p 


(5) 


(we use H to denote the Euclidean norm in R™). The Sobolev space VF®’P(X) 
is defined to be the set of all / such that ||/||g p < oo. The Sobolev imbedding 
theorem says that, for a wide range of X (definitely including our main example 
X = [0,1] C R), the functions in bF®’^’(X) can be made continuous by a change 
on a set of measure zero; we will always assume that this is true for our object 
space X and consider the elements of VF®’^(X) to be continuous functions. Let 
C (X) be the Banach space of continuous functions / : X ^ R with finite norm 
ll/llcix) sup 3 ,gx \fix)\- The Sobolev imbedding theorem also says that the 
imbedding IT'*’^’(X) ^ C'(X) (i.e., the function that maps each / G hF®’^(X) 
to the same function but considered as an element of C'(X)) is continuous, i.e., 
that 


Cs,p '■= Cw‘».p(x) < oo : 

notice that is just the norm of the imbedding hF'*’P(X) ^ C'(X). These 
conclusions depend on the condition p > m/s (there are other parts of the 
Sobolev imbedding theorem, dealing with the case where this condition is not 
satisfied). For a proof in the case X = R™, see, e.g., [ 21 , Theorems 7.34(c) and 
7.47(a,c); this implies the analogous statement for X with smooth boundary 
since for such X every / G W'®’^(X) can be extended to an element of bF®’P(R"*) 
without increasing the norm more than a constant times (see, e.g., [27|,p. 81). 
We will say “domain” to mean a subset of R" which satisfies the conditions of 
regularity mentioned in this paragraph. 

The norm Q (sometimes called the Sobolev-Slobodetsky norm) is only one 
of the standard norms giving rise to the same topological vector space, and 
the term “Sobolev space” is usually used to refer to the topology rather than a 
specific norm; in this paper we will not consider any other norms. The restriction 
s G (0,1) is not essential for the results in this paper, but the definition of Ij-Hgp 
becomes slightly more complicated when s > 1 (cf. 123 ); 0 gives a different 
but equivalent norm. 

For comparison purposes we will also define the spaces bF^’^([0,1]), p G 
(l,oo): set 

ll/lli,p — \fix)f + \fix)f dx^ 

and include in 1F^’P([0, 1]) all absolutely continuous functions / : [0,1] ^ R 
with ll/llj p < oo. We will always assume X = [0,1] in the case s = 1. 

We can now deduce the following corollary from Theorem 0 It is known 
that is satisfied for the Sobolev spaces W^’P{X.) (see Let p € [2,oo) 

and s G (m/p, 1). There exists a constant Cs^p > 0 and a prediction algorithm 
producing G [—Y,Y] that are guaranteed to satisfy 

N .. N 

^ E (2^" - ^ E (y" - (6) 

n—l n—1 
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for all = 1, 2,... and all D e VF®’P(X). 

In informal discussions below we will continue to call terms such as the 
second addend on the right-hand side of the “regret term”, and say that the 
corresponding prediction algorithm is “_R-competitive”, where R is the regret 
term. 

According to 0, we can take 

Cs,p = c^p + 1 , 

but in fact 

= 4 X 8.68i-i/P^c2_p + 1 (7) 

will suffice (see ll^ below). In the special case p = 2 one can use Hilbert-space 

methods to improve o, which now becomes, approximately, 

11.78^c2 2 + 1, (8) 

to 

2 + 1 (9) 

m, Theorem 1); using Banach-space methods we have lost a factor of 5.89. For 
example, in the case s = 1, 0 gives Cs,p « 17.92 and 0 gives Cs,p ~ 3.04 (the 
value 2 = coth 1 was found in ESI; for further details of the case s = l,p = 2, 
see 123, §4). 

Application to the Holder-continuous functions 

An important limiting case of the norm is 


:= max 


sup I/(a;) I, sup 


L XeX 


x,yG'X.:x^y 


fix) - f{y) 


\x-y\ 


where / : X —> R is, as usual, assumed continuous. The space VF®’°“(X) 
consists of the functions / with |l/||s < oo, and its elements are called Holder 

continuous of order s. 

The Holder-continuous functions of order s are perhaps the most intuitive 
formalization of the functions with Holder exponent h > s. Let us see what 
Theorem n gives for them. 

Suppose that X is a bounded domain in R™, p £ (l,oo), and s,s' £ (0,1) 
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are such that s' < s. If / e iy"’°°(X) 


I S ,p 


\f{x)r dx+[ [ 

Jx Jx 


fix) - fiy) 


\x-y\" 


dx dy 


i/p 


I’' 

'-y\ 


<\CP + 


X JX 


C\x-y\" 


■-y\‘ 


dx dy 

Trri 

x-y\ 


i/p 


= + CP 

< icP + CP\'K 


f [ \x-y\ 


— m-\-sp—s'p 




1/p 


X ^X 

diam X 


_^—m-\-sp—s'p ^ 


/O 


dxdy ) 

T^m/2 


1/p 


t"* dt 


= C 1 + TO 


dt \T{m/2 + 1) 


7r™/2 (diamX)(*-"')p' 


i/p 


r(TO/2 + i) 


IX 


(s — s')p 


, (10) 


where C := ||/||soo, |X| stands for the volume (Lebesgue measure) of X, and 
diamX stands for the diameter of X; remember that /T{m/2 + 1) is the 
volume of the unit ball in M™. Therefore, m gives an explicit bound for the 
norm of the continuous imbedding IT®’°°(X) ^ IT® ’^(X). 

Fix an arbitrarily small e > 0. Applying 0 to IF® ’^(X) with p > m/s 
sufficiently close to to/s and to s' G (to/p, s), we can see from (IIOII that there 
exists a constant Cs,e > 0 such that 


-.AT N 

^ E (2/- - 2^")' ^ E (22- - ^(^-))' + + y) iV-®/-+® 

n—1 n—1 

( 11 ) 

holds for all Af = 1, 2,... and all D G IF®’°“(X). 


3 Implications for a stochastic Reality 

In this section we discuss implications of Theorem ^ for statistical learning 
theory and filtering of random processes. Surprisingly, even when Reality follows 
a specific stochastic strategy, competitive on-line results do not trivialize but 
provide new meaningful information. 

Statistical learning theory 

In this section we apply the method of |H] to derive a corollary of Theorem ^ 
for the statistical learning framework, where (x„,j/„) are assumed to be drawn 
independently from the same probability distribution on X x [—T, T]. 

The risk of a prediction rule (formally, a measurable function) D ; X —s- R 
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with respect to a probability distribution P on X x [—Y,y] is defined as 

riskp^D) := f {y - D{x)f P{dx,dy). 

JXxl-Y,Y] 

Our current goal is to construct, from a given sample, a prediction rule whose 
risk is competitive with the risk of small-norm prediction rules in 

Fix an on-line prediction algorithm and a sequence (cci, ?/i), ( 0 : 2 , 2 / 2 ), ■ ■ • of 
examples. For each n = 1, 2, ... and each x S X, define Hn{x) to be the predic¬ 
tion /i„ S R output by the algorithm when fed with {xi, yi ),..., yn-i),x. 

We will assume that the functions i?„ are always measurable (they are for our 
algorithm, constructed in the following two sections). The prediction rule 

_ 1 N 

Hn{x) := 

n—1 

will be said to be obtained by averaging from the prediction algorithm. 

The following result is an easy application of the method of to ; we 
refrain from stating the analogous result based on CD- 

Corollary 1 Let li. be a domain in M™, p > 2, s G (jnjp, 1), and let Hpf, 
N = 1,2,be the prediction rule obtained by averaging from some prediction 
algorithm guaranteeing 0). For any D G kF®’P(X), any probability distribution 
P on iK. X [—F, Y], any N = 1,2,and any ^ > 0, 

Tiskp(HN) <riskp{D) + YCs,p (|1-D|L,p + f) N-^/p+ 4Y^^ 2lii‘^N-^/^ (12) 
with probability at least 1 — 5. 


Proof Without loss of generality we assume that D{x) G [—F, F] for all a; G X 
and that Hn{x) G [—F, F] for all a: G X and n. Outside an event of probability 

5:=2exp(^-iT^ (13) 

we have (some steps will be explained later on) 

_ 1 ^ 

nskp(Hjv) < ^ T riskp(i?„) (14) 

n—1 

1 ^ 

^ +€ (15) 

n—1 
1 ^ 

< ^ + f) N-^/p + e (16) 

n—1 

1 ^ 

riskp(P) + YCs,p (||P||,,p + f) TV-Fp + 2e (17) 

n—1 

= riskp(P) + YCs,p (||0||,_p + f) N-^/p + 2e. (18) 
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The first inequality, ca, follows from the convexity of the function t . 
Inequalities 03 and 03 follow from Hoeffding’s martingale inequality m ; see 
also E], Theorem 9.1 on p. 135). Either of 11511 and 03 holds with probability 
at least 1 — 6/2; therefore, both will hold with probability at least 1 — Finally, 
inequality 03 follows from 

Our goal, 03 , follows from the inequality between the extreme terms of 
03-03 if we substitute 


e = 2Y^^2\n^N-^/^ (19) 

(which is a different way of writing 11311 1. I 

For a hxed (5, the regret term (the sum of the second and third addends on 
the right-hand side) of (IT^ grows as For a discussion of related results 

in statistical learning theory, see m (versions 1 and 2), §5. 

Filtering of random processes 

Suppose we are interested in the value of a “signal” 0 : [0,1] ^ R sequentially 
observed at moments tn '■= n/N, n = 1,...^N, where iV is a large positive 
integer; let 6n := Q{tn)- The problem is that our observations of are imper¬ 
fect, and in fact we see yn = dn + ^n, where each noise random variable has 
zero expectation given the past. We assume that 0 belongs to 1T®’P([0, 1]) (but 
do not make any assumptions about the mechanism, deterministic, stochastic, 
or other, that generated it) and that 9n,yn & for ^ known constant 

Y. Let us use the /j,„ from Theorem ^ as estimates of the true values 0„. The 
elementary equality 

= (a — 5)^ — + 2ab (20) 

implies 

N N N N 

= '^{yn - yinf - '^{yn - +2'^{yn - en){fin - dn)- ( 21 ) 

n—1 n—1 n—1 n—1 

Hoeffding’s inequality in the martingale form shows that, for any C > 0, 

p|2^^(yn - 6n){^^n - On) > cj < CXp . 

Substituting this (with C expressed via the right-hand side, denoted 5) and JH)) 
into G3, we obtain the following corollary, which we state somewhat informally. 


Corollary 2 Let p> 2, s G (1/p, 1), and 5 > 0. Suppose that 0 G 1T®’^([0,1]) 
and yn — On + & [~F,T], where On := Q{n/N) G [—T,F] and are ran¬ 

dom variables whose expectation given the past (including On) is zero. With 
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probability at least 1 — 6 the pin of W satisfy 


1 ^ I -T 

^ (A^n - Onf < YCs,p (||0||,,p + y) N-^/p + sry2In(22) 

n—1 

The constant in is the one in O- From CD, we can also see that, if 
we assume 0 S hF®’°°([0,1]), 

^ ^ (/in - enf < YC,,, (||0||, + y ) n -^+^ + 8y2y^iv-i/2 (23) 

n—1 

will hold with probability at least 1 — <5. 

It is important that the function 0 in CD and CD does not have to 
be chosen in advance: it can be constructed “step-wise”, with 0(/) for t G 
{n/N, (n -I- l)/fV] chosen at will after observing and taking into account all 
other information that becomes available before and including time n/N. A 
clean formalization of this intuitive picture seems to require the game-theoretic 
probability of CD (although we can get the picture “almost right” using the 
standard measure-theoretic probability). 

In the case where 0 is generated from a diffusion process, it will almost 
surely belong to 1]) (this follows from standard results about the 

Brownian motion, such as Levy’s modulus theorem: see, e.g., CD, Theorem 
9.25), and so the regret term in 12211 and 12311 can be made for an 

arbitrarily small e > 0. The Kalman hlter, which is stochastically optimal, gives 
a somewhat better regret, Corollary d however, does not depend 

on the very specihc assumptions of the Kalman hlter: we do not require the 
linearity, Gaussianity, or even stochasticity of the model; the assumption about 
the noise is minimal (zero expectation given the past). Instead, we have 
the assumption that all 9n and ?/„ are chosen from [—Y,Y]. It appears that 
in practice the interval to which the On and yn are assumed to belong should 
change slowly as new data are processed. This is analogous to the situation 
with the Kalman hlter, which, despite assuming linear systems, has found its 
greatest application to non-linear systems d; what is usually used in practice 
is the “extended Kalman hlter”, which relies on a slowly changing linearization 
of the non-linear system. 

Until the end of this section we will discuss in more detail the standard 
stochastic approach to the problem of hltering (CDj ^.Iso |d, ED, §VI.7, 
and, for a continuous-time version, CHI, d, §10.1). The signal is now mod¬ 
eled as a random process 0t, t G [0,1], governed by the stochastic differential 
equation 

dQt = (ao(t) + ai(f)0t) dt + b{t)dBt, (24) 

where Bt is the standard Brownian motion (a zero-mean Gaussian continuous 
stochastic process on [0,1] such that Bq = 0 and the variance of each increment 

— Bt 2 is |fi — ^ 2 !) and ao,ai,5 : [0,1] ^ K are bounded Borel functions. 
The process starts from a random value 0o (modeled as a Gaussian random 
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variable independent of Bt) and, as before, is observed at points tn '■= n/N; 
9n := Q{tn)- The observed sequence is yn = 9n + crCn (neither On nor are 
assumed to be bounded by a known constant), where cr is a positive constant 
and are standard Gaussian random variables independent between themselves 
and of the initial position ©g and the Brownian motion Bt . In some important 
respects this is a simplification of the usual filtering problems; e.g., we consider 
scalar rather than vector ©t and y„. 

Earlier we discussed the possibility of positive contributions of competitive 
on-line results, such as Theorem^ to the problem of filtering, and now we will 
briefly explore the connection in the opposite direction: limitations on compet¬ 
itive on-line prediction following from the known optimality properties of the 
Kalman filter. According to cu, there is a prediction algorithm 0{N ®+'^)- 
competitive with kE®’°°([0,1]), for any e > 0. It remains an open problem to 
show that the rate (we will disregard plus or minus e in the rest of this 

section) cannot be improved, but the following considerations make it likely in 
the case s « 1/2. (For an alternative argument, see, e.g.. Theorem 4 in |:i7|.l 

Suppose the prediction rule D ■. [0,1] ^ K is generated randomly as the 
trajectory of the stochastic process (1^ with ©0 = 0, ao(t) = 0, ai[t) = 0, and 
h{t) = c > 0 (i.e., D{t) = cBt, where B is the standard Brownian motion). The 
positive constant c is chosen small as compared to Y, so that D{t) is unlikely 
to take values approaching —Y or Y. It is clear that the observations yn are 
generated independently (given B) from the normal distribution N{D{tn),a‘^) 
with mean D{tn) and variance cr^; if y„ falls outside [—T, F], it is truncated to 
Y sign yn- The variance cr^ > 0 is assumed to be small enough for the probability 
of |y„| <Y to be close to 1 for each n (or we can even take c and a slightly, 
say logarithmically, dependent on N so that max„ \yn\ <Y with a probability 
tending to 1). According to the standard properties of the Kalman filter (see, 
e.g., i2ni, Theorem 13.4, or m, Theorem VI.7.1), the variance 7 „ of the best 
estimate of On (which is also the best estimate of y„), n > 1, given yi,..., y^-i 
satisfies the recurrent equation 


7„+i = 7„ 


N 


7 


2 

n 


0-^ -I- 7n 


It is clear that 7 „ is an increasing sequence tending, as n oo, to a limit equal 
to 

^ ctT 

2iv ^ 7^’ 

and that it will move significantly towards this limit already during the first 
yjN rounds (cf. Figure EJ. By Hoeffding’s inequality, the excess of the total loss 
of the stochastically best algorithm (the Kalman filter) over the total loss of D 
will be of order , and so the excess of its average loss will be of order 
(with probability very close to 1). 

Since the sample paths of diffusion processes almost surely belong to 
IT®’°°([0,1]) for all s G (0,1/2), we can see that no prediction algorithm can be 
(9(7V-i/2-ej-competitive with 1]). Therefore, if we disregard the 
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Figure 2: The growth of the Kalman filter’s error 7 „, n = for c = 

tr^ = 1 and N = 100; the final value jn is approximately 


epsilons, our algorithm achieves the optimal rate of decay in N of the regret 
term for s « 1/2. 

A similar argument might have also worked in the case s < 1 /2 had we known 
an analogue of the Kalman filter result for the fractional Brownian motion, 
where B is replaced with a stochastic process h e (0,1/2), defined in the 
same way except that the variance of each increment B^^'^ — b[^^ is \ti — t 2 \^^ 
(notice that B = Unfortunately, we know of no such result, although a 

step in this direction is made in m- 


4 More geometry of Banach spaces 

In the proof of Theorem^we will need not only Clarkson’s modulus of convexity 
o but a whole range of different moduli of convexity and smoothness. In our 
description we will often follow m, for information about other moduli and 
further references, see m We will only consider Banach spaces of dimension 
at least 2. 

Moduli of convexity and smoothness 

A natural modification of Clarkson’s modulus of convexity was proposed by 
Gurary m 



inf (l— inf lltu + (1 — Uulbr 
u,v€Su V ‘6[0,1] 


It is clear that 

^c/(e) < ^c/(e) < 2(5(7(e) 


(25) 
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(cf. the proof of Lemma|21below), and it was shown recently [ 7 ] that this relation 
cannot be improved. 

The standard modulus of smoothness was proposed by Lindenstrauss 


Puij) 


sup 

u,v£Su 




+ Ik 

2 



r > 0. 


(26) 


Lindenstrauss also established a simple but very useful relation of conjugacy (cf. 
1201, §12, although 6 is not always convex [23]) between 6 and p: 

Pu-{t)= sup (^-Su(e)); (27) 

£6(0,2] ^ ^ '' 

we can see that 2pir* is the Fenchel transform of 2S[/. 

The following inequality will be the basis of the proof of Theorem ^ in the 
next section. Suppose a PBFS (F satisfies the condition 0 of Theorem Cl By 
m we obtain for the dual space (F* to (F, assuming r G (0,1]: 

sup - (e/2)P/p] = rVg, (28) 

£ 6 ( 0 , 2 ] ^ ^ ' 


where q := pj{p — 1) (the supremum in 12811 is attained at e = 2t^/^p ^i). 

The Banach space U is called uniformly convex if Su{e) > 0 for all e G (0, 2], 
and it is called uniformly smooth if pu{t) —*■ 0 as t —> 0. All uniformly convex 
and all uniformly smooth Banach spaces U are reflexive (i.e., U** = U] see, e.g., 
|28|. Proposition I.e.3 on p. 61). 

If F is a Hilbert space, the “parallelogram identity” 

\\u + v\\l + \\u-v\\l = 2\\u\\l + 2\\v\\l (29) 

immediately gives 

6v{e) = 1 - Vl - (e/2)2 > e'/S 

and 

Pv{t) = \/l + — 1 < t^/2. (30) 

Nbrdlander m proved that the unit balls in Hilbert spaces are most convex 
and smooth: if t/ is a Banach space and H is a Hilbert space. 


6u{e)<Sv{e) = l- 

j / 

P\j{t) > Pv(j) = Vl + - 1 - 

The original definitions o and of the moduli of convexity and smooth¬ 
ness look very different, and Banas [S] proposed a definition of modulus of 
smoothness similar to CJ: 



sup I 1 

u.vGSu V 
\W-'"\\u=r 



tG (0,2). 


(32) 
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Figure 3: Relation between p and pb 


The difference p]j{e) — 5u{f) measures the degree to which (the unit ball in) U 
is deformed [H] (it is always zero for Hilbert spaces). What we will need in this 
paper is the modification of in the direction of (ESI : 

PuiT)-= sup sup (1-||tM+(l-t)?;||j^), re (0,2). (33) 

u.vGSu t€[0,l] 


Since the standard results about moduli of convexity and smoothness are 
about the definitions o and we first need to establish connections between 
(EEl and P|l . The first of these results appears in (but we still prove it since 
IS] is less easily accessible than most other papers in our bibliography). 

Lemma 1 ([S]) For all r G (0,2), 


Pl/jT) 

1 - Pui'^) 


< pu 



(34) 


Proof Let c < p\j{t) be such that, for some u,v G Su satisfying ||m — r'||[/ = t, 

u + v 


= 1 — c 


(it is clear that c can be chosen as close to pIj{t) as we wish). Set 


1 u + V 
1 - c 2 ’ 


1 T 


u - u 


1 - c2 


(cf. Figure El where OA = u, OB = v, OE = (u + u)/2, OF = u', and 
FD = t'v'). Since u',v' G Su, we have 


Puir') > 


\u' +t'v'\\,j + \\u' -t'v' 


- 1 = 


1 — c 


- 1 , 
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which can be rewritten as 


Letting c ^ p\jij) completes the proof (the modulus of smoothness is continuous 
by, e.g., EH, Proposition l.e.5 on p. 64). I 

Corollary 3 For all r G (0,1], 


Puir) < Pu{t). (35) 

Proof Let r G (0,1]. Following El, proof of Lemma 1, we obtain 

t ^ 2 \\u\\jj — ||m + v\\ij 

PuiV = sup - ^ 

u,v£Su ^ 

/ „ I|W + ^^||[7 + ||U-'(^||[7- ||M + ^^||^7 _ r ^ 1 

— 9 ~ 9 — 9' 

u,vGSu ^ Z Z 

We can now easily deduce (TO from and the fact that pjj is a non-decreasing 
function (E2I, Proposition l.e.5): 


Pu{t) < 


Puij) 

l-Pf/M 


< Pu 



< Pu{t). 


Lemma 2 For all r G (0, 2), 


PuiT) < ‘^Pui'r). 

Proof Suppose p\j{t) > c. Let u,v G Sjj and t G [0, 1] be such that ||m — '^Wu = 
T and 

||tu + (1 — t)v\\jj < 1 — c. 

Without loss of generality we assume t < 1/2. Since 


U + V 

2 

u 

1 

- 2t 


1 - 2t 


2-2t 


u + 


1 


2-2t 


{tu -I- (1 — t)v) 


< 


1 


2-2t 


2-2t 


\\tu + (1 - t)v\ 


u 

^l-2t 
^ ^ 2-2t 


1 


2-2t 


(1-c) 


‘2-2t-c ^ 2^ = i_ £ 
2-2t - 2 2’ 


we have Pi/{t) > c/2. 
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Direct sums of uniformly smooth spaces 

If Ui and U2 are two Banach spaces, their weighted direct sum Ui 0 U2 is 
defined to be the Cartesian product Ui x U2 with the operations of addition 
and multiplication by scalar defined by 

(ui, U2) + {u'i,U 2 ) := {ui + u[,U 2 + U2), c(mi, W2) := (cui, CU2); 

we will equip it with the norm 

||(wi,U2)||[;^g3(72 ■= \/ai IWlWu^ +02 \\u2\\u^, (36) 

where Oi and 02 are positive constants (to simplify formulas, we do not men¬ 
tion them explicitly in our notation for Ui © 1 / 2 )- The operation of weighted 
direct sum provides a means of merging different Banach spaces, which plays an 
important role in our proof technique (cf. Corollary 4). The “Euclidean” 
definition of the norm in the direct sum suggests that the sum will be as 
smooth as the components; this intuition is formalized in the following lemma 
(essentially a special case of Proposition 17 in ^21) P- 132). 

Lemma 3 If Ui and U2 are Banach spaces and f : (0,1] — > R, 

(Vr e (0,1] : puAt) < /(t) & P(72(t) < /(t)) 

^ (Vr e (0,1] : pui®U 2 iT) < 4.34/(t)) . 

Proof We will follow the proof of Proposition 17 in ca, which is based on 
the following weak form of the parallelogram identity (P|l . valid for all Banach 
spaces: 

\\u + v\\l + \\u-v\\l-2\\u\\l-2\\v\\l 

^ 2||u||^(||u + t;||^ + ||w — — 2||u||^) (37) 

(see na, Lemma 16 on p. 132); it is clear that 13711 implies 

h + v\\l + h - v\\l - 2 Ml - 2 Ml < 4 Ml Pu (Mu/ Holler) ■ (38) 

Let = (mi, M 2 ) and = (mi, M 2 ) be arbitrary norm one vectors in Ui © C/ 2 . 
Applying 13811 to (u,m) := {ui,TVi) and (m, m) := (m 2 ,tm 2 ), we obtain 

M+tvi\\1^ + ||mi - TMilly^ -2 ||mi||^^ -2r2||Mi||y^ 

< 4 ||ui||^^ pu, (r ||mi||^^ / ||mi||j;J (39) 


M + TV2\\1^ + 11^2 - TV2\\1^ - 2 MWl^ - 1^211^, 

< 4 MWl^ PU 2 [t \\v2\\u^ / ||U 2 ||, 7 J . (40) 
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Multiplying by oi and (gOJ) by 02 and summing now gives 


Im'^ + I 


Ui®Ui 


+ kt' — TV 


tir 

c/iec/2 

2 

i=i 


- 2 - 2r^ 


/ \\Ui 




)■ (41) 


To estimate the sum over j = 1,2, notice that: 

• when \\vj\\,j, < \\uj\\^,, 

Pu, (-rlktllc/, /ll^llc/,) ^ PuA^) lAjWuj /\Aj\\u, 

(by the convexity of p, following from the convexity of the Fenchel trans¬ 
form, m, and the reflexivity of all uniformly convex and all uniformly 
smooth spaces); 

• when \\vj\\jj^ > IImjIIc;., 

PUi (jWvjW^J WujWjj^ < LpuAr) {\\vj\\u, / hjWu,) 

(where L < 3.18 is a constant satisfying p{a)la^ < Lp{t)/t^ for all pos¬ 
itive T < a\ see ca, Proposition 10 on p. 128 and the remark after its 
proof). 

Using the Cauchy-Schwarz inequality, the sum can be bounded above as follows: 


2 

E “1 II II U- II II [/, / 11^1 II (7, ) 

1=1 

2 

< E“i PU-(^)niax(||wj||^. ,L||uj||^J 

1=1 

< ^E“i ii^iiici,j 

j^fAr)a, {\\uX,+LA\vAl,)j = (42) 

(the last line assuming r G (0,1]). Now we have all we need to deduce the 
conclusion of the lemma (some steps will be explained after the equation): when 
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TG (0,1] 


i (ll^t 

< Q (ii^t + 

< (l + T^ + 2VL^ + lf{T)y^^ < {1 + t^Y^^ + VL^ + lf{r) 

< 1 + /(t) + Vl'^ + 1/(t) = 1 + (^1 + + y /(r) 

(the first inequality follows from the convexity of the function 1 the second 
from (ED and the third from the mean-value theorem, and the fourth from 
Ndrdlander’s bound Id 111 1. It remains to compare the resulting inequality with 
the definition of the modulus of convexity and remember that L < 3.18. I 

Convexity and smoothness for Sobolev spaces 

It was shown by Clarkson ^0! (§3) that, for p G [2, oo), 

^Lp(e)>l-(l-(e/2nl/^ 

(And this bound was shown to be optimal in CHI.) A quick inspection of the 
standard proofs (see, e.g., |2], 2.34-2.40) shows that the underlying measurable 
space 11 and measure p oi = LP(yi,ii) can be essentially arbitrary (only the 
degenerate case where dimL^ < 2 should be excluded), although this generality 
is usually not emphasized. 

It is easy to see (cf. |2], 3.5-3.6) that the modulus of convexity of each 
Sobolev space kF®’P(X), s G (0,1) and p G [2, oo), also satisfies 

W(x)(e)>l-(l-(e/2)")'/"- (43) 

Indeed, with each / G Ib'*’^(X) we can associate a function / : X U X^ ^ K. 
(we regard the sets X and X^ as disjoint) such that 

f{x) = f{x) for x G X, 

7ix, y) = for {x, y) G X^; 

\x - 2/r 

the measure on X U X^ coincides with the Lebesgue measure on the measurable 
subsets of X and with the measure whose density is {x,y) G X^ i—> |x — 
with respect to the Lebesgue measure, on the measurable subsets of X^. The 
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bound m can now be deduced from Clarkson’s result as follows: 


:= inf 

/,9fc‘3lV«.P(X) 

ll/“9ll w«.P(x)=^ 


1 - 


f + 9 


inf 

/,g:X^R 

7,SGL*’(XUX’^) 

I■^“®lli:,p(xux2)~'‘ 


> inf 

«,!;GL^(XUX^) 

II““'"IIlP(XUX2)=^ 


1 - 


1 - 


7 + 9 


LP(XUX2) , 


u + v 


LP(XUX2) / 

= <5LP(xux2)(e) > 1 — (1 — (e/2)^) 


i/p 


Since, for t G [0,1] and p > 1, (1 — < 1 — t/p (the left-hand side is 

a concave function of t, and the values and derivatives of the two sides match 
when t = 0), we have 

<5w'».p(x)(e) > (e/2)^/p. (44) 

Therefore, as we said in m the Sobolev spaces indeed satisfy the condition 
of Theorem n 


5 Proof of Theorem [T] 

In this section we partly follow the proof of Theorem 1 in m (§6). 


The BBK29 algorithm 

Let 17 be a Banach space. We say that a function 4) : [—y,y] x X —> 17 
is forecast-continuous if $(/i,x) is continuous in /j, G [—Y,y] for every fixed 
a; G X. For such a $ the function 






x^) + (y - m)® (m. Xn) 


u 


n—1 

'^{y^ - 
2=1 


u 


(45) 


is continuous in ^ G [—Y,Y]. 

Banach-space Balanced K29 algorithm (BBK29) 

Parameter: forecast-continuous 4> : [—Y,y] x X —> 17, with U a Banach space 
FOR n = 1,2,...: 

Read a;„ G X. 
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Define /„ : [—Y,YY ^ R by 

Output any root ^ G [-Y,Y] of fn{-Y,fi) = fn{Y,^) as fj^n, 
if there are no such roots, output /in G {—Y,Y} 
such that SUPyg[_y y] fn{y,fJ-n) < 0. 

Read yn G [-Y,Y]. 

END FOR. 


The validity of this description depends on the existence of /i G {—Y,Y} sat¬ 
isfying supj^g[_y y] fn{y,y) < 0 when the equation fn{-Y,yL) = fn(Y,y) does 
not have roots /i G \—Y,Y]. The existence of such a /r is easy to check: if 
fn{—Y,y) < fn{Y,y,) for all y G [—Y,Y], take y :=Y to obtain 

/„(-y,/i)</„(y,/i) = o 

and, hence, supj^g[_y fniy, m) ^ 0 by the convexity of 14511 in y; if fn{—Y, y) > 
fn{Y,y) for all y G [—T, T], setting y := —Y leads to 

fniY,y)<U-Y,y)=0 


and, hence, supj^g[_y fn{y, y) < 0. The parameter $ of the BBK29 algorithm 
will sometimes be called the feature mapping. 

Theorem 2 Let ^ be a forecast-continuous mapping from [—Y,y] x "K to a 
Banach space U and set c$ := sup^g[_y Y].a;6X 2 :)||^. Suppose pu{t) < 

ar"?, Vr G (0,1], for some constants q > 1 and a > 1/q. The BBK29 algorithm 
with parameter $ outputs /i„ G [—Y,y] such that 


N 

- Pn)^{Pn,Xn) 

n—1 


< 2Yc^ (2agiV)^/« 
u 


always holds for all N = 1,2^.... 


(46) 


Proof Set 


our goal is to prove 


Sn := 


N 

'^iVn - yn)^{tln,Xn) 

n—1 


U 


Sn < 2Tc$ (2agiV)^/''. 


For TV = 1, this follows from 


2rc<j, < 2yc$ (2agTV)^/'*, 


which in turn follows from 2aq > 1, which in turn follows from the condition 
o-> i/q. It remains to prove that 

S'tv-i < 2rc$ {2aq{N - 1))^/'^ 
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implies 


Sn < 2yc<j, {2aqN)^^‘^ (47) 

for N >2. Without loss of generality we assume that = fN{Y, Mw) 

and replace S'at in 631 by /at := fN{Y,fJ,N)- 
Fix > 2. We will assume that 

Sn -1 < 2Yc^ {2aq{N - & /at > 2 ^ 0 $ {2aqN)^^'^ (48) 


and arrive at a contradiction. By the definition of p^, 


Sn-1 > In Pu 


2Y ||$(^Ar, a^Af)!! 

In 


(cf. Figure 131) • Since Jn > 2Yc^ (remember that we are assuming (I48II ), by 
Corollary 131 and Lemma [21 this implies 


Sn-1 > In 


^l-2a 


/ 2YmpN,XN)\\ 
V fN 



As the right-hand side is a monotonically increasing function of /n (which can 
be checked by differentiation), in combination with 1481) the last inequality gives 


2Yc^ {2aq{N - > 2rc4, {2aqNf'‘' (l - 2a {{2aqN)~^/‘^Y^ , 

i.e., 

(iV-1)1/9 >lVi/9 ■ 

It remains to rewrite the last inequality as 

N^/i -{N- 1)1/9 < iAri/9-i (49) 

<7 

and notice that, by the mean-value theorem, the left-hand side of 63 equals 

i(iV-0)i/9-i 

9 

for some 9 G (0,1): as 1/g — 1 < 0, we have the required contradiction. I 


The feature mapping for the proof of Theorem U 

In the proof of Theorem^we will need two feature mappings from [—T T] x X 
to different Banach spaces: first, $i(p,, a;) := p (mapping to the Banach space 
K.), and second, ^2 '■ [~T T] x X ^ T* such that $2 (/a., a;) is the evaluation 
functional k^, : / 1 -^ fix), f € J-. We combine them into one feature mapping 

$(^,a;) := ($i(^,a;),4>2(/a,x)) (50) 


21 




to the weighted direct sum U := ]R0iF*, with the weights oi and 02 to be chosen 
later. By Lemma|3| (1^ . and (TTHll . pj/(t) < ar'^, where a := 4.34/g. With the 
help of Theorem 121 we obtain for the BBK29 algorithm with parameter $: 


N 




n—1 


N 


^ ^ {Un 


< 




n—1 

N 


^ ^ iUn /rn)4?(/in, 2;n) 


and 


N 


'^{Vn - Hn)D{Xn) 
N 

^(.Vn - Mn)k; 


N 


^ ^ iUn Mrt)kx„(44) 


< ^2rc<i, {2aqN)^^'^ (51) 


^ N \ 

'y 'X ViT- ~ Mn)kx„ I (D) 

\n=l J 


< 


n—1 
N 


\m:F = 




N 


'^{'Un - fJ'n)^2{fJ'n,Xn) 


n—1 


\\D\ 




< 




^ ^ (.Vn Xn') 


n—1 




\D\\Tr<^z2Yc^{2aqNy'’^\\D\\j, (52) 

a/^ 


for each function D G T. 


Proof proper 

The proof is based on the inequality 

N 

'^{Vn- finf 

n—1 

N N N 

— ^ ^ (jjn -^(^n)) H“ 2 ^ ^ {^D(^Xn) f^n^iVn Mn) ^ ^ Mn) 

n—1 n—1 n—1 

N N 

^ ^ ^ {Un -^(^n)) H“ 2 ^ ^ (-D(x7t,) fJ-n^iyn Mn) 

n=l n=l 

(immediately following from l|20|l ). Using this inequality and with 

ai := Y~‘^ and a 2 := 1, we obtain for the fXn € output by the BBK29 

algorithm with $ as parameter: 

N 

n—1 

N 

- ~ + 2 
n—1 
N 

< Y,{yn - D{xn)f + 4rc$ {2aqNy^‘^ (|jD||^ + Y). 

n—1 


N 

E 

n—1 


y^niVn - Mn) 


N 


E D{Xn){yn - Mn) 


n—1 
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Since 


c$ < \JaiY'^ + a2C^jr = + 1, 

we can see that 0 holds with 

4(2ag)^/'? = 4 x 


(53) 


in place of 40. 

6 Banach kernels 

An RKHS can be dehned as a PBFS in which the norm is expressed via an inner 
product as ||/|| = y/ (/, /). It is well known that all information about an RKHS 
T on Z IS contained in its “reproducing kernel”, which is a symmetric positive 
definite function on ( 0 , §§1.1-12). The reproducing kernel can be regarded 
as the constructive representation of its RKHS, and it is the reproducing kernel 
rather than the RKHS itself that serves as a parameter of various machine¬ 
learning algorithms. In this section we will introduce a similar constructive 
representation for PBFS. 

A Banach kernel B on& set Z is a function that maps each finite non-empty 
sequence of distinct elements of Z to a seminorm on 

R." and satisfies the following conditions (familiar from Kolmogorov’s existence 
theorem PI], §111.4): 

• for each n = 1,2,..., each sequence zi,... ,Zn of distinct elements of Z, 
each sequence (H,..., t„) G R", and each permutation (? ’’’ i" )) 

||(tii, . . . ~ 11(^1; ■ ■ ■ J ^’t.)IIb(zi,...,z„) i 


• for each n = 1, 2,..., each k = 1,..., n, each sequence zi,..., of distinct 
elements of Z, and each sequence (H,..., f^) G R*’, 

11(^1, • • ■ , 4)|Ib(zi....,z,) = •■•,4,0,..., ^))\\b{z„...,z„) ■ 

The Banach kernel of a mapping $ : Z ^ 17 to a Banach space U is the 
Banach kernel B defined by 

II (il, ■ ■ • , 4 )|Ib(zi,...,z„) ■“ + ■ ■ ■ + tn^izn)\\ij ■ 

Proposition 1 For each Banach kernel B on Z there exists a Banach space U 
and a mapping ^ : Z such that B is the Banach kernel 0 /$. 

Proposition 0 is a special case of the following Proposition PJ but we still need 
to prove it as the proof of Proposition PI depends on it. 
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Proof of Proposition^ Let Ui be the set of all formal linear combinations tizi + 
■■■ + tnZn, where n S {0, 1 , 2 ,...}, (ti, ..., t„) e (K \ {0})”, and zi, ..., 
are distinct elements of Z. (There is only one linear combination, denoted 
0, corresponding to n = 0.) We do not distinguish linear combinations if they 
have the same addends (perhaps listed in different orders). The set Ui is a linear 
space with the obvious operations of addition and multiplication by scalar: in 
the sum the addends that are multiples of the same z & Z should be grouped 
together (and removed if the resulting coefficient is zero) and multiplication by 
0 gives 0. 

For each linear combination tizi + • • • + t„z„ G Ui, n > 0, its seminorm is 
defined to be ||(ti,..., tra)||B(zi seminorm of 0 G C/i is defined to be 

0; it is easy to check that this is indeed a seminorm (it is well defined because of 
the first condition in the definition of Banach kernel, and the triangle inequality 
follows from the second condition). Two linear combinations are said to be 
equivalent if their difference has zero seminorm (this is indeed an equivalence 
relation because of the second condition). Let 1/2 be the set of all equivalence 
classes. 

The norm of u G U 2 can be defined as the seminorm of any element of the 
equivalence class u. It remains to take the completion of U 2 as U and to define 
^ : Z ^ U so that $(z) is the equivalence class containing Iz G Ui. I 

The Banach kernel of a PBFS T on Z is the Banach kernel B defined by 

■ ■ ■ T^n)\\B{zi,...,Zn) + • • • + lljF* J 

where : IF —> R, z G 2', is the evaluation functional f € T ^ f{z)- 

Proposition 2 For each Banach kernel B on Z there exists a proper Banach 
functional space J- on Z such that B is the Banach kernel of J-. 

Proof Let $ : 2 ^ 17 be a mapping to a Banach space U such that B is 
the Banach kernel of d) (such a d) exists by Propositional. Without loss of 
generality we will assume that $(2) spans U. Define F to be the set of all 
functions / : 2 ^ R. of the form 

f{z) := <))($(z)), (54) 

where </) is a continuous linear functional on U, (f € U*. The norm of the 
function itMIl is ||/||;r := ■ We will prove that IF is a PBFS and that B is 

the Banach kernel of T. 

It is obvious that 7^ is a linear space (under the usual pointwise operations 
of addition and multiplication by scalar) and that \\f\\jz is well-defined (i.e., 
does not depend on the choice of if satisfying there is only one such (f>). 

All defining properties of a norm are clearly satisfied for H-H^; in particular, 
WfWyz = 0 implies / = 0. The completeness of F follows from the completeness 
of [7*. The boundedness of the evaluation functionals for F means that, for 
each fixed z & Z, 

sup |(/)($(z))| < 00 ; 
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this immediately follows from the definition of IMIrr.. This completes the proof 
that is a PBFS. 

It remains to check that B is the Banach kernel of JF, i.e., that 

ll(ii, ■ ■ = U ^ + • • • + (55) 

for all n = 1,2,..., all (ti,..., G (K \ {0})", and all distinct zi,... ,Zn G Z. 
We can rewrite as 

= U^(l>iti^Zi) + ---+tn^{Zn))\\u-, 

since B is the Banach kernel of $, this is equivalent to 

Pl‘h(-Zl) + • • ■ + tn^{Zn)\\(j = \\(j) (j) (ti$(zi) + • • • + tn‘h(^ra))||(7»* ■ 

The last equality follows from the fact that the canonical imbedding of U into 
U** is an isometry m, §4.5). I 

Remark A Banach kernel B on Z can be visualized as a family b{zi,... ,Zn) C 
R", n ranging over {1,2,...} and zi,... ,Zn over sequences of distinct elements 
of Z, of balanced convex sets containing a neighborhood of zero. Such a family 
can be obtained from B by replacing each seminorm the unit 

ball in that seminorm; it is well known that the seminorm and the corresponding 
unit ball carry the same information (see, e.g., jSJi Theorems 1.34 and 1.35). 
Of course, the sets b(zi ,..., Zn) should satisfy the two conditions of consistency 
analogous to those in the definition of a Banach kernel; e.g., the second condition 
becomes: for all n = 1,2,..., all fc = 1,..., n, and all (zi,..., z„) G Z^ whose 
elements are all different, the set b{zi,... ,Zk) is the intersection of b{zi,... ,Zn) 
and the hyperplane Zk+i = ■ ■ ■ = z„ = 0. 

Now we can state more explicitly the prediction algorithm described above 
and guaranteeing 0. Following (1^ (with $ defined by (l^ b define 


+ II (j/l - Fl, ■ • ■ , Vn-l - F«-l, 2/ - F)IIb(xi.., 


1/2 


1/2 


+ 11(2/1 -Ml,- ■•,2/n-l -Mn-l)llB(:,i,...,x„_i) ' (56) 


This allows us to give the kernel representation of BBK29 with $ defined by 
CT : its parameter is a Banach kernel on the object space X. 
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Algorithm guaranteeing @ 

Parameter: Banach kernel B oi T 
FORn= 1,2,...: 

Read € X. 

Define /„ : [—P, ^ R by (IHHll . 

Output any root /r G [-Y,Y] of /„(-F,^) = fn{Y,^) as 
if there are no such roots, output /i„ G {—Y,Y} 
such that supj^g[_y ^^] /„(!/,/x„) < 0. 

Read i/„ G [-Y,Y]. 

END FOR. 
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